internal covariate shift
- North America > United States > California > Orange County > Irvine (0.14)
- North America > United States > Indiana > St. Joseph County > Notre Dame (0.05)
- North America > Canada (0.04)
- (2 more...)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.05)
- North America > Canada > Quebec > Montreal (0.04)
Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models
Batch Normalization is quite effective at accelerating and improving the training of deep models. However, its effectiveness diminishes when the training mini-batches are small, or do not consist of independent samples. We hypothesize that this is due to the dependence of model layer inputs on all the examples in the minibatch, and different activations being produced between training and inference. We propose Batch Renormalization, a simple and effective extension to ensure that the training and inference models generate the same outputs that depend on individual examples rather than the entire minibatch. Models trained with Batch Renormalization perform substantially better than batchnorm when training with small or non-i.i.d.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.05)
- North America > Canada > Quebec > Montreal (0.04)
Rethinking Layer-wise Model Merging through Chain of Merges
Buzzega, Pietro, Salami, Riccardo, Porrello, Angelo, Calderara, Simone
Fine-tuning pretrained models has become a standard pathway to achieve state-of-the-art performance across a wide range of domains, leading to a proliferation of task-specific model variants. As the number of such specialized models increases, merging them into a unified model without retraining has become a critical challenge. Existing merging techniques operate at the level of individual layers, thereby overlooking the inter-layer dependencies inherent in deep networks. We show that this simplification leads to distributional mismatches, particularly in methods that rely on intermediate activations, as changes in early layers are not properly propagated to downstream layers during merging. We identify these mismatches as a form of internal covariate shift, comparable to the phenomenon encountered in the initial phases of neural networks training. To address this, we propose Chain of Merges (CoM), a layer-wise merging procedure that sequentially merges weights across layers while sequentially updating activation statistics. By explicitly accounting for inter-layer interactions, CoM mitigates covariate shift and produces a coherent merged model through a series of conditionally optimal updates. Experiments on standard benchmarks demonstrate that CoM achieves state-of-the-art performance.
- Asia > Middle East > Jordan (0.04)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- Europe > Italy > Emilia-Romagna > Modeno Province > Modena (0.04)
Mask-PINNs: Mitigating Internal Covariate Shift in Physics-Informed Neural Networks
Jiang, Feilong, Hou, Xiaonan, Ye, Jianqiao, Xia, Min
Physics-Informed Neural Networks (PINNs) have emerged as a powerful framework for solving partial differential equations (PDEs) by embedding physical laws directly into the loss function. However, as a fundamental optimization issue, internal covariate shift (ICS) hinders the stable and effective training of PINNs by disrupting feature distributions and limiting model expressiveness. Unlike standard deep learning tasks, conventional remedies for ICS -- such as Batch Normalization and Layer Normalization -- are not directly applicable to PINNs, as they distort the physical consistency required for reliable PDE solutions. To address this issue, we propose Mask-PINNs, a novel architecture that introduces a learnable mask function to regulate feature distributions while preserving the underlying physical constraints of PINNs. We provide a theoretical analysis showing that the mask suppresses the expansion of feature representations through a carefully designed modulation mechanism. Empirically, we validate the method on multiple PDE benchmarks -- including convection, wave propagation, and Helmholtz equations -- across diverse activation functions. Our results show consistent improvements in prediction accuracy, convergence stability, and robustness. Furthermore, we demonstrate that Mask-PINNs enable the effective use of wider networks, overcoming a key limitation in existing PINN frameworks.
- North America > Canada > Ontario > Middlesex County > London (0.04)
- Europe > United Kingdom > England > Lancashire > Lancaster (0.04)
- Europe > France > Hauts-de-France > Nord > Lille (0.04)
- North America > United States > California > Orange County > Irvine (0.14)
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > Indiana > St. Joseph County > Notre Dame (0.05)
- (3 more...)
Reviews: Revisit Fuzzy Neural Network: Demystifying Batch Normalization and ReLU with Generalized Hamming Network
The authors use a notion of generalized hamming distance, to shed light on the success of Batch normalization and ReLU units. After reading the paper, I am still very confused about its contribution. The authors claim that generalized hamming distance offers a better view of batch normalization and relus, and explain that in two paragraphs in pages 4,5. The explanation for batch normalization is essentially contained in the following phrase: "It turns out BN is indeed attempting to compensate for deficiencies in neuron outputs with respect to GHD. This surprising observation indeed adheres to our conjecture that an optimized neuron should faithfully measure the GHD between inputs and weights."
Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models
Batch Normalization is quite effective at accelerating and improving the training of deep models. However, its effectiveness diminishes when the training minibatches are small, or do not consist of independent samples. We hypothesize that this is due to the dependence of model layer inputs on all the examples in the minibatch, and different activations being produced between training and inference. We propose Batch Renormalization, a simple and effective extension to ensure that the training and inference models generate the same outputs that depend on individual examples rather than the entire minibatch. Models trained with Batch Renormalization perform substantially better than batchnorm when training with small or non-i.i.d.